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LONG-RANGE CONSTRAINTS IN THE STATISTICAL 
STRUCTURE OF PRINTED ENGLISH 


By N. G. Burton and J. C. R. LicKLipEer, Massachusetts Institute of Technology 


The sequences of letters that constitute printed English are among the most 
intricate and also among the most important patterns produced by human beings. 
As a characteristic of language behavior, the stochastic structure of letter sequences 
is therefore of special interest to psychologists. 

Redundancy is an essential feature of the stochastic structure. The sequences are 
redundant in the sense that letters in one part limit the possibilities and influence 
the probabilities of letters in another part. If a literate subject is given a succession 
of 10 letters, the constraints of spelling, grammar, syntax, and idiom make it possible 
for him to predict with some accuracy what the eleventh will be. 

The ordering of letters is so complex, actually, that the only statistical machine 
that can deal adequately with the problem is the human verbal mechanism. For this 
reason, although the study of redundancy in samples of printed English might in 
principle be restricted to counting letter sequences and making calculations on the 
basis of relative frequencies, practicable methods of investigation are as psycho- 
logical as the problem. 

Working on the assumption that an’ intelligent human being knows the probability 
structure of his language well enough to approximate the performance of an ideal 
predictor, Shannon developed a technique for making quantitative estimates of the 
redundancy of sequences of letters.’ Showing S the first »-1 letters of an n-letter 
passage, he had S$ make ‘educated guesses’ until the correct mth letter was named. 
This procedure was repeated over and over with values of » running from 1 to 15 
and also with 2 = 100. From the distribution of the number of guesses required in 
repeated trials by several Ss, Shannon determined upper and lower bounds on the 
entropy or uncertainty or ‘amount of information’ in printed English. 

The amount of constraint exerted by the context is best shown by the relative 
redundancy, which is one minus the ratio of the actual amount of information per 
letter to the amount (4.7 bits) that could be transmitted if the letters in the 
sequence were selected independently of one another. According to Shannon's data, 
a letter is, on the average, at least three-quarters determined by what has preceded 
it; less than one quarter of its information is new. The question is, is all that con- 
straint upon the selection of a given letter imposed by letters near it in the sequence, 
or do distant influences have an appreciable effect. This is what is hard to tell from 
Shannon’s data. The curves, plotted from Shannon’s results to show relative re- 
dundancy as a function of the number (7) of letters in the test-sequence, rise 
noticeably between 2 = 15 and » = 100, but the course of increase between these 


* Accepted for publication September 3, 1954. This research was supported in 
part under Air Force Contract AF 18(600)-322, monitored by the Operational 
Applications Laboratory, Air Force Cambridge Research Center, Air Research and 
Development Command. 

1C. E. Shannon, Prediction and entropy in printed English, Bel] Syst. Tech. J., 
30, 1951, 50-64. 


650 


This content downloaded from 209.175.73.10 on Sat, 26 Dec 2015 04:38:08 UTC 
All use subject to JSTOR Terms and Conditions 


CONSTRAINTS IN PRINTED ENGLISH 651 


two points cannot be determined, and the function therefore cannot be extrapolated. 
There is, consequently, no rigorous basis for deciding whether the estimate of re- 
dundancy based on » = 100 is a good figure for the total redundancy, or whether 
printed English might turn out to be, say, 95% redundant if very-long-range con- 
straints were taken into account. Since long-range constraints include the influences 
of subject matter, style, level of presentation, and the dynamics of the situation re- 
ported or described, it is possible 2 prior? that they might be quite strong. To estimate 
their strength—to obtain a more complete picture of the course of the curve re- 
lating estimated relative redundancy to sample length—is the object of the present 
study, 


Procedure. We followed Shannon's experimental procedure closely except that 
our sources of test-material were 10 paper-backed novels instead of a single volume. 
(Shannon's only source appears to have been Jefferson the Virginian by Dumas 
Malone.) The 10 novels were of about the same level of reading difficulty (average 
R E=70.6) and abstraction (average R= 75.2) as measured by Flesch’s scales.” 
From each source we selected 10 passages of each of 10 lengths: 2-1 = 0, 1, 2, 4, 8, 
16, 32, 64, 128, and approximately 10,000 letters. In every case, the page, line, 
and position in the line of the test-letter were chosen with the aid of a table of 
random numbers. 

Ten Ss were used, one for each novel. Each was either a graduate student or a 
psychological research technician. Considerable pains were taken with their selection, 
since we wished intelligent, highly verbal people who-would approximate ideal 
predictors and who would be intrigued by the guessing game. 

The Ss were told that they would be given passages of various lengths, removed 
from context, and that they should in each instance supply the next letter. They 
were instructed to count the space between two words as a letter. Selections of 128 
letters or fewer were dictated to the Ss. They wrote these selections down and 
studied them before attempting to guess what came next. The 10,000-letter passages 
were presented as printed, with adjacent letters masked out. Ss worked individually, 
each at his or her own pace. In all cases they were required to continue guessing 
until they had named the next letter correctly. All guesses were recorded. 


Results. Upper and lower bounds on relative redundancy, derived by Shannon’s 
method from distributions of numbers of guesses in the 100 tests (10 samples x 10 
Ss) for each value of »-1, are shown as solid lines and filled circles in Fig. 1. The 
open circles give the results of a retest to check the inversion in the data at 2-1 = 64. 
Evidently the irregularity was a sampling fluctuation. 

For passages more than four letters long, our estimates of redundancy are some- 
what lower than those obtained by Shannon. Our results in the interval » = 1 to 
n = 30 are quite similar, however, to those of Frick and Sumby in tests with tran- 
scribed samples of radio messages between aircraft and control towers.* Certainly 
any differences in overall level of constraint can be ascribed to differences in type of 
material or differences among subjects. 


ee Flesch, Measuring the level of abstraction, J. Appl. Psychol., 34, 1950, 
384-390. 

* F.C. Frick and W. H. Sumby, Control tower language, J. Acoust. Soc. Amer., 
24, 1952, 595-596. 
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The curves of Fig. 1 level off at about 2-1 = 32. The constraint imposed by 
10,000 preceding letters is little greater than that imposed by 32. This does not 
necessarily mean that determining influences do not extend over long spans. It may 
be that knowledge of letters some distance removed will permit better-than-chance 
prediction. It appears, however, that remote preceding letters impose no marked 
constraints that are not imposed also by neighboring preceding letters. Written Eng- 
lish does not become more and more redundant as longer and longer sequences are 
taken into account. 

The confounding of Ss with sources—the use of one novel with each S—was 
intentional but, as we now see it, unfortunate. In fact, the restriction of the source 
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material in any way that allows S to learn of the restriction is to some extent 
antagonistic to the aim of discovering how much long-range influences add to the 
constraint imposed by neighboring letters. It may be argued that much of what S 
gains from knowledge of the author’s style or from the subject matter of the source 
he gains as soon as he sees or figures out what the source is. The effect of this argu- 
ment is lessened, however, by the results of a postliminary experiment. Ten Ss were 
able to predict the last letter of 10,001-letter sequences from known sources no 
better than the last letters of 33-letter sequences from comparable but unknown 
sources. 

We may conclude, therefore, that constraining influences in printed English may 
preclude two-thirds or four-fifths of the freedom inherent in haphazardness, but that 
they do not approach complete determination much more closely than that, even 
when long-range forces are included in the measure. This conclusion may probably 
be extended with good approximation from letters to phonemes of the spoken lan- 
guage. It may be extended also, with at least fair approximation, to words and 
sentences, which are probably more natural units. The approximation in the latter 
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case is introduced by the fact that words and sentences are of unequal letter-lengths 
(or phoneme-lengths), but the non-uniformity probably does not bias the relative 
redundancy very much. In any event, the only approach to estimating the redundancy 
of word and sentence sequences appears to be through letters or phonemes. Shannon's 
guessing-game technique can in principle be applied to words and sentences, but it 
leads to trials that never terminate. 


Summary. An experiment modeled after Shannon’s was conducted to determine 
the extent to which estimates of the redundancy of English texts are dependent upon 
the number of preceding letters known to S. Data obtained indicate that, while the 
estimate of relative redundancy increases as knowledge of the foregoing text is 
extended from zero to approximately 32 letters, increasing the known number of 
letters beyond 32 does not result in any notable rise. 
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